Spatial adaptation in heteroscedastic regression: 
Propagation approach 

Nora Serdyukova 
May 13, 2011 

Abstract 

The paper concerns the problem of pointwise adaptive estimation in regression when the 
noise is heteroscedastic and incorrectly known. We use the method of local approximation 
including as a particular case the local polynomial smoothing. Specifically, the model with un- 
known mean and variance is approximated by a local linear model with an incorrectly specified 
covariance matrix. 

Adaptive choice of degree of localization in this case can be understood as a choice of an 
appropriate parametric model from a given collection. For the selection from the family of mod- 
els we employ based on Lepski's method the FLL technique recently suggested in Katkovnik 
and Spokoiny (2008). The problem of the choice of certain parameters in this type of proce- 
dures was addressed in Spokoiny and Vial (2009). The authors called their approach to the 
calibration of the parameters "propagation". We developed and justified the methodology for 
the heteroscedastic case in the presence of noise misspecification. The analysis shows that the 
adaptive procedure allows a misspecification of the covariance matrix with a relative error of 
order (logn)~^ , where n is the sample size. 

1 Introduction 

Consider a regression model 

r = / + Ey'£, £^A/-(0,/„) (LI) 

with response vector Y G M" and an unknovifn covariance matrix Sq = diag((Tg j^, . . . , CTq „) . This 
model can be written as 

Y,^ f{X,)+ao,^e^, i^l,...,n (L2) 

with design points e A" C M'* . Given a point x G X , the target of estimation is the value of 
the regression function f{x) at the point x . 

The idea is to replace model (|1.2I) by a local parametric model 



Vi = fe{Xi) + fjj Ej, i : e Uh{x), (1.3) 

where Uh{x) == {t : \\t — x\\ < h} and G 6 C is an unknown finite-dimensional pa- 
rameter. Then employing one of the well-developed parametric methods we can estimate 9 by 
9(yi, ... ,1/(1', x) , and then use the estimator /gj-y^ Yd)(^) based on the observations from the 
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"true" model (|1.2p for estimatfon of f{x) . Therefore we have to choose the focal model (the collec- 
tfon of estimators {fe{-), G Q}) and the appropriate degree of locality h . This method of local 
approximation originated from [M] , [15] , [16] , [17] . In what follows we shall consider approximation 
by local linear models of the following type: 

?/, = + CT, e„ i:X,£Uh{x), (1.4) 

where — ^"(^1) = {ipi{Xi — x),. . . ,ipp{Xi — x))^ is a vector of basis fmictions {V'j(")} which 
already are fixed. The main issue then is the choice of the appropriate bandwidth h such that the 
estimator ^ 

built on the base of the localized data would be a relevant estimator for f{x) . For this purposes 
the bandwidths selection should be done in a data-driven way, and the adaptive selection from the 
family {fg^{-)}h>Q for fixed basis is equivalent to the adaptive choice of bandwidth. Notice also 

that the coefficients 9^^^ (x) as well as their estimators depends on x and should be calculated for 
every particular point of interest x . On the other side the localization reduces infiuence of the 
choice of the functions {V'j(0} allowing to use simple collections. 

Moreover, in our set-up the covariance matrix Sq is not assumed to be known exactly, and the 
approximate model used instead of the true one reads as follows: 

y = *^6» + Si/2e, (1.6) 

where ^ = {"i^i, . . . is a. p x n design matrix and S = diag(cri , . . . , cr^) , min{cr|} > is 

available to a statistician. Thus the model is misspecified in two places: in the form of the regression 
function and in the error distribution. 

The proposed approach includes the important class of polynomial regressions, see [B], [T7], [H], 
|28| . For example in the univariate case x IR , due to the Taylor theorem, the approximation 
of the unknown function f{t) for t close to x can be written in the following form: feit) ~ 
gio) 5,(1) _ 2;) + . . . + g{P'i)(t - x)P-^/ip - 1)1 with the parameter 6 = (e^°\ e'^^\ . . .,e^P-^'>y 
corresponding to the values of /(•) and its derivatives at the point x , if they exist. The matrix ^ 
then consists of the columns 'i'i — (l, Xi — x, . . . , {Xi — x)p^^ / {p — 1)!) and corresponds to the 
well known polynomial smoothing. If the regression function is sufficiently smooth, then for any 
t close to a: , up to a reminder term f{t) « feit) , and the estimate of f{x) at the point x is 
given by f{x) — /^(^.^(a^) = 0^°^ . See for more details on local polynomial estimatin [5] or 
The local constant fit at a given point x £ IR is covered as well. In this case the design matrix 
^ = (!,...,!), and fe{Xi) — 6 = 0^'^^ = fg{x) , i = 1, . . . ,n . This type of approximation in 
our set-up with known constant noise is treated in [18] and [27] . 

Nonparametric estimation in heteroscedastic regression under the L2 losses was studied in 
[T^ , [13] and series of papers [5] , [5] , [TU] . For estimation of the mean with L2 -risk in Gaussian 
homoscedastic model with unknown variance the penalties allowing to deal with the complexity of 
such a collection of models were proposed in [2] . However the problem of "local model selection" 
addressed in the present paper is quite different to the model selection in the sense of [J and 
|25j related to estimation with global risk. The minimax pointwise estimation in heteroscedastic 
regression is in focus of [3] . 
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2 Estimation procedure 



2.1 Local parametric estimation 

We shall perform the adaptive selection from a collection of K estimators corresponding to model 
(|1.4p with different sizes h of neighborhood Uh{x) . Fix a point x G M'' as a center of localization 
and a basis {ipj} ■ Let the localizing operator be identified with the corresponding matrix. For 
the next nonparametric step we will need a sequence of nested windows. Thus for every x the 
sequence of localizing schemes (scales) Wfc(x) , k = 1, . . . , K is given by the matrices Wfc(x) = 
diag(wk,i{x), . . . jWk.nix)) , where the weights Wk.i{x) € [0, 1] can be understood, for instance, as 
smoothing kernels Wk^i{x) = W{{Xi — x)h'^^) . Let a particular localizing function u)(.^.)(a;) be 
fixed; the aim is to choose on the base of available data the index k of the optimal bandwidth hk ■ 
To simplify the notation we sometimes suppress the dependence on the reference point x . Denote 

by 

W,<^=^fl]-V2>v,S-V2 = diagf^,...,^V k = l,...,K. (2.1) 

Let be a compact subset of W . Inside of any "window" given by , fc = 1 , . . . isT we calculate 
the quasi-maximum likelihood estimator (QMLE) Ok = 6/fc(x) = {e^^\x) , . . . , O^^ ^\x)y of 6 
defined as _ 

6»fc =*argniaxL(Wfe,6»), (2.2) 



where 



L(Wfc,0) = Wfc(r-*'0)+i? 



Y,-^]e\^^+R, (2.3) 



1 

1=1 

R stands for the terms not depending on 9 , and 

= *(X, -X) = (7/^1 (X, ~X),.. .,i>p{X, - X)) 

If the p X p matrix — Bfe(a;) given by 



n 

Bfe *W,*^ = Y.^.^^,J^ (2.4) 

1=1 '^i 



is positive definite (Bfc ;^ 0), then 



B,i*W.r = B-^5:*,K,^. (2.5) 



Recall that in the case of the polynomial basis the estimator 0^ {x) is a local polynomial estimator of 
6{x) corresponding to the fcth scale. In what follows we assume that n > p, and that detB^ > 
for any k = 1,. . . ,K . Because p = rank(Bfc) < min{p, rank(Wfc(x))} this requires the following 
conditions on the design matrix ^ and the minimal localizing scheme Wi (x) : 
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{Al) The p X n design matrix ^ is supposed to have full row rank, i.e., 

dimC(*^) = dimC(*^*) = p. 

(A2) The smallest localizing scheme Wi{x) is chosen to contain at least p design points such 
that wi^i{x) > , i.e., p < : wi,i{x) > 0} . 

Assumption {A2) in practise is automatically fulfilled, since, for example, in it means that for 
the local constant fit we need at least one observation and so on. Usually it is intrinsically assumed 
that, starting from the smallest window, at every step of the procedure every new window contains 
at least p new design points. 

The formulas (|2.2p give a sequence of estimators {dk{x)}j^^i . It was noticed in [1 that in the 
case when the true data distribution is unknown the QMLE is a natural estimator for the parameter 
maximizing the expected log-likelihood. That is, for every k = 1, . . . , K , the estimator 9k{x) can 
be considered as an estimator of 

ei{x) =^ argmaxEL(Wfc,0) (2.6) 
eee 

= argmin(/ - *^6»)^Wfc(/ - *^6») 



= B-^*W,/ = B-^^*j(x,)^. (2.7) 



Recall that we do not assume / = * 6 even locally. It is known from that in the presence of 
a model misspecification for every k the QMLE Ok is a strongly consistent estimator for 9*,,{x) , 
which also is the minimizer of the localized KuUback-Leibler [19] information criterion: 

n 

Bl{x) = argminVKL(A^(/(X0,a,),A/'(*7e,a,))u;fe.,(a;) 

= argminX:i/(XO-*70p^^^ 

with KL(P, Pe) Ep [ log (dP/dPe)] • For the properties of the KuUback-Leibler divergence see, 
for example, [28j . 

It follows from the above definition of dUx) and from p.2p that the QMLE 6k admits a 
decomposition into deterministic and stochastic parts: 

Ok = B-i*Wfe(/ + E^'e) = 61 + B-i^WfeE^'e (2.8) 
E6k=dl, (2.9) 

where e ~ TV (0, /„) . Notice that if / = "^^6 , then 6^. = 6 for any k , and the classical parametric 
set-up takes place. 

2.2 Adaptive bandwidth selection 

Let a point x € X C M" , a basis {ipj} and the method of localization W(^.^.){x) be fixed. The 
crucial assumption for the procedure under consideration to work is that the localizing schemes 
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(scales) yVk{x) — diag^Wk.i, ■ ■ ■ ,Wk.n) are nested. One can say that the locahzing schemes are 
nested in the sense that for the corresponding matrices the fohowing ordering condition is fulfined: 

{A3) For any fixed x and the method of localization «;(. .) (a;) the following relation holds: 

Wiix) <...< Wk{x) <...< Wxix). 

For the kernel smoothing this condition means the following. Let the sequence of bandwidths 
{hk} be ordered from the smallest to the largest one, i.e., hi < ... < hx , and let Wkix) = 
diag(?i'fe_i , . . . , 'Wk,n) be the localizing matrix, corresponding to the bandwidth hk ■ Here the weights 
Wk,i = Wk,i{x) — W{{Xi—x)/hk) S [0,1] are nonnegative functions such that W{u/hi) < W{u/hk) 
for any < hi < hk < I , and W{u) -> as |m| — >■ oo , or even is compactly supported. 

Recall that, given x ^ X , a basis {ijjj} , and the method of localization , we look for 

the estimator /g-(x) of f{x) having form (|1.5p . where the coefficients 9'^^\x) are the components 
of the estimator 

e{x) = [e^^\x), . . .^efix))'', (2.10) 

corresponding to the adaptive choice of the index k G {1, . . . , K} , i.e. to the choice of the degree 
of localization. 

The selection of 9{x) from {Ok{x)} , k = 1,...,K can be done by the application of the 
Lepski [20] method to the comparing of the maximized log-likelihoods L(Wfc, 6k) ■ This is the idea 
of the FLL technique suggested in [T5]. More precisely, to describe the test statistic, define for any 
9 , 0' G 8 the corresponding log-likelihood ratio: 

L(Wfc, 9, 9') = L{Wk,9) - L(Wk,9'). (2.11) 

Then, using the approach suggested in |18) . for every / = 1, . . . , K , the fitted log-likelihood (FLL) 
ratio is defined as follows: 

L{Wi,9i,9') ='niaxL(W,,0,0')- 
By Theorem 14. 2[ for any / and 9 , the FLL is a quadratic form: 

2LiWi,9i,9) = (9i - 9yBi{9i - 9). 
This prompts the use, see [18] . the FLL-statistics: 

Tik 2L{Wi,9i,9k) 

= (9i-9k)'^Bi{9i-9k) , Kk. (2.12) 

In the algorithm the smallest bandwidths corresponding to fc = 1 is always accepted, then the 
adaptive index k is selected by Lepski's selection rule with the FLL test statistics {Ti„i} , I < I < 
m< K: 

k = max{fc < K : Tim < Ih I < m < k} . (2.13) 

Finally we set 9 = 9-^. 

The procedure (|2.13p involves parameters 31, . . . ,2iK~i related to the large deviations of {Tim} , 
1 < I < m < K . As the classical Lepski procedure, the (|2.13p controls the risk of estimators for 
the case of dominating bias. The opposite case of the negligible w.r.t. the noise bias is usually 
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handled by employing the advanced empirical process technique, however sometimes providing the 
constants far away of being optimal. Notice also that the Wilks-type Theorem 14.31 below gives the 
bound for the expected fitted log-likelihood ratio: 

E\2L{Wk,ek,0lW <C{p,r) (2.14) 

where the constant C{p,r) does not depend on the degree of localization and is given by the rth 
moment of the distribution with p degrees of freedom: 

Cip, r) = Elx^r = 2T(r + p/2)/T{p/2), (2.15) 

Therefore we shall follow the practical idea from [27] and [18] allowing to avoid hard large deviations 
analysis and to calculate the thresholds rather sharp numerically. We assume at this step that the 
critical values 31, . . . ,iK-i are already fixed satisfying the following set of K — I inequalities: 

Definition 2.1. Propagation conditions (PC) 

Let Ok denote the last accepted estimate after the first k steps of the procedure: 

= (2.16) 

The critical values 31, . . . ,}k-i satisfy 

Eo,^\(ek - ekVBkidk -dk)]"- <aC{p,r) for all k^2,...,K, (2.17) 

where C{p,r) is defined by (I2.15p . a e (0,1] is an additional "free" turning parameter which can 
be taken equal to 1 , and Eq.s stands for the expectation w.r.t. the measure Af (0, E) . 

Remark 2.1. Lemma [4. II from Section [4] shows that in the "no bias" situation the Gaussian dis- 
tribution provides a nice pivotality property: the actual value of the parameter 6 is not important 
for the risk of adaptive estimate, so one can put = in p.l7p . 

Remark 2.2. Clearly at any step k < K oi the algorithm the "current value" of the adaptive 
estimator 6k depends on the thresholds 31, . . . ,3fe-i . The theoretical aspects related to the het- 
eroscedasticity of model and to the incorrectly known variance is the focus of the present paper. 
Thus we do not detail the practical aspects of the thresholds calibration only mentioning that 
in practise this can be done by Monte Carlo simulations under the known "parametric" measure 
J\f (0, S) . Moreover one needs to calculate them only once. For detailed consideration of the prac- 
tical aspects of the calibration as well as for the computational results see [27] or [18] focused on 
the image denoising by local constant fitting. Demo-versions of the software are available on the 
web page http://www.cs.tut.fi/ ~, lasip/. 

3 Theoretical study 

In order to control the admissible level of misspecification for "model" covariance matrix from (|1.6p 
we need to introduce the following condition on the relative variability in errors: 

(A4) There exists S G [0, 1) such that 

1 — (5 < CTq j^/af < 1 + 6 for all i = 1, . . . ,n. 
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3.1 Upper bound for the critical values 

Let us at this step recall the notion of the Lowner partial ordering: for any real symmetric matrices 
A and B we write A < B li and only if Ad < -d^ B-d for all vectors d , or, equivalently, if 
and only if the matrix B — A is nonnegative definite. Assuming {AA) , the true covariance matrix 
So ^ S(l + 5) , and the variance of the estimate 6k is bounded with B^^ : 

14='Var0fc = B-i*WfeSoWfe*^B-i (3.1) 

^ (l + (5)B^i*WfcI]Wfe*^B^i 

= (l + (5)B^i*E-i/2w2s-i/2^T3-i 

^ (l + (5)B^:i*E-i/2yVfcI]-i/2*TB-^ 
= (l + <5)B,-i*Wfe*X^' 

= (l + <5)B,-i. (3.2) 

The last inequality follows from the observation that all the entries of the "weight" matrix Wk 
do not exceed one, implying W| r< Wk ■ The strict equality takes place if the {wk^i} are boxcar 
(rectangular) kernels and the noise is known, i.e., if (5 = . To justify the procedure one need 
to show that the critical values chosen by (PC) are finite. This is obtained under the following 
assumption: 

(A5) Let for some constants uo and u such that 1 < uo < m for any 2 < k < K the matrices 
Bfe satisfy 

Remark 3.1. In the "one dimensional case" p = 1 , that is for the local constant fit, the "matrix" 
Bfc = '^fe i'^'j"^ — ^fe-i is just a weighted "local design size". Assume for simplicity that 

= , the weights are rectangular kernels Wk^i{x) — l{\Xi — x\ < hk/2} , and the design is 
equidistant. Then for n sufficiently large 

-^k = — J z^n x\< —} 

1=1 

and the condition (v45) means that the bandwidths grow geometrically: hk = uhk-\ ■ 
Now we are able to formulate a theorem on finiteness of the critical values. 

Theorem 3.1. The theoretical choice of the critical values 

Assume (Al) — (A3) and {A5) . The adaptive procedure (j2.13p in the considered set-up is well 
defined in the sense that the choice of the critical values of the form 



3fe = - {r (A- - fc) log u + log (K/a) - | log(l - 4^) - log(l -«-'■) + C(p, r) } (3.3) 



M 

provides the conditions (j2.17p for all k < K . Here 

■22-[r(2r + p/2)r(p/2)]i/2 



C{p,r) = log 



r(r + p/2) 
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and fi G (0,1/4) . Particularly, 



The proof is given in subsection 14.21 



(3.4) 



3.2 Quality of estimation in the nearly parametric case: 

The small modeling bias condition and the propagation property 

The critical values of the procedure 31 , . . . , ^k-i were selected by the propagation conditions (|2.17|) 
under the measure Af {9, E) . Now 6\ ^ ■ ■ ■ 9^ Ki 6 up to some k < K , and the covariance matrix 
is Eq . The aim is to formahze the meaning of " « " and to justify the use of the critical values in this 
situation. For this purposes we will take into account the discrepancy between the joint distributions 
of linear estimates 01,...,©^ for k = 1, . . . , K under "no bias" assumption corresponding to the 
distributions with the mean 0\ ~ ■ ■ ■ = 6\ = 9 and the incorrectly specified covariance matrix S , 
and in the general situation with 9\ ^ ■ ■ ■ ^ 6\. and the covariance Sq . Denote the expectations 
w.r.t. these measures by Eg^x; '■= "^kfiji and E/.Sq '■— ]Efc,/,Eo i respectively. Denote a pxk matrix 
of the first k estimators and the expectations correspondingly by 



@l - Ef,j:,@,^{9l,...,9l), 
@k Ee,s0fe-(0,...,0). 
Let B stands for the Kronecker product of A and B defined as 



A(S)B ^ 



a2iB 



a22B 



\ ClmlB a,n2B 



ainB \ 

a2nB 
^rn n B j 



Denote the pk x pk covariance matrices of vec 0^ — (B^ , ... ,61^ ) £ 



Sfc.O 



dcf 



Vare,s[vcc0fc] = Dfe(Jfc ® , 
Var/,So[vec0fc] = Dfc( Jfe ® I]o)D^, 



by 



(3.5) 
(3.6) 



where the matrix Jk is a fc x fc matrix with all its elements equal to 1 , and the pk x nk matrix 
Dfc is defined as follows: 



Dk^diag{Di,...,Dk) 
l = l,...,k. 



(3.7) 



By Lemma 14.81 from Section 2] under Assumption {A4) with the same S the similar relation 
holds for the covariance matrices Sfe and S^.o of the linear estimators: 



(1 - ^)Sfc ^ Efc,o ^ (1 + '5)Sfc , k<K. 



(3.6 
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In spite of the moment generating function of vec ®k has the form corresponding to the mul- 
tivariate normal distribution, see Lemma 14.101 in Section |4l this representation makes sense only 
if Sif is nonsingular. Notice that rank( S) — n . From Jk €5X^0 it follows only that 
, similarly, S/f,o 'il . However, without any additional assumptions it is easy to show, see 
Lemma [49] in Section SI that for rectangular kernels T^k >- 0. On the other hand, due to p.8|) . 
it is enough to require nonsingularity only for the matrix corresponding to the approximate 
model (|1.6p . and its choice belongs to a statistician. In what follows we assume that Hk >- . 

Denote by IP^^ = A/" (vec 0^, Sfe) and by = TV (vec 0* , Sfc^o) , k = 1,...,K, the 

distributions of vec &k under the null and under the alternative. Denote also the Radon-Nikodym 
derivative by 

dot 

Then, by Lemma [4. Ill from Section |4j the KuUback-Leibler divergence between these measures has 
the following form: 

rpk \ dof r,r|, , f'^^^^fjk,\ 



2KL(F;;5,^,,n*,5,) = 2E^,So log I 



/ det Sfc 



k 

e.s 



A(fc) + log (^d^^^ ) + tr(S^iSfc,o) - pk, (3.10) 



where 



b{k) =^ vec0^-vec0fe (3.11) 
A(fc) =^ b{kyj:^^b{k). (3.12) 

If there would be no any "noise misspecification" , i.e., if J = implying S = Eq , then A(fc) = 
b{ky'E-^^b{k) = 2KL(JP^s>^e,s)- Therefore, this quantity can be used to indicate deviation 
between the mean values in the true (jl.ip and the approximate ()1.6|) models. Clearly, under (W) , 
the quantity A(fc) grows with k, so following the terminology suggested in [2 7j . we introduce the 
small modeling bias condition: 

{SMB) Let for some k < K and some 9 exist a constant A > such that 

A(fc) < A. 



Monotonicity of A(fc) and Assumption (SMB) immediately imply that 

A(fc') < A for ah k' < k. 

The conditions ((X^ yield -pk5 < tr(S^^Sfc^o) -pk < pkS . Thus (|4.19p implies the bound for the 
KuUback-Leibler divergence in terms of S : 

- ^ log(l + .) + ^ - 4^ < mJP^^,^,IP^^,) < log(l _ ,) + ^ + ^. (3.13) 
Moreover, as (5^-0+ 

A{k) - 2pk6 + o{S) < 2KL(JP^ So'^e.s) < A{k) + 2pkd + 0(6) . (3.14) 
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This means that, if for some k Assumption (SMB) is fulfilled and 6 = 0{1/K) , then the KuUback- 
Leibler divergence between the measures iPg and IP^ bounded by a small constant. 
Now one can state the crucial property for obtaining the final oracle result. 

Theorem 3.2. Propagation property 

Assume (Al) — {A5) and (PC) . Then for any k < K the following upper bounds hold: 



2{l-5) 



|r/2 



< {anxlYf'\l + 5Y"\l - 5)-''"' exp {^(^)^^ 

dof 1 1 for homogeneous errors, 

where ^{5) -^2^^^ 

Here 9k = 9k{x) is the QMLE defined by (|2.2p . and 9k(x) = ^ ^^-^^^^k^^^ adaptive 
estimate at the k th step of the procedure. 

The proof is given in Subsection 14.41 

Remark 3.2. Bounds (|4.28p and (|4.27p below give a condition on the relative error in the noise 
misspecification. As (5 0+ for every k < K it holds that 

^iS)^,-2pkS + o{5)<logEeMZl]<^{S)^^+2pkS + o{S), 

1+0 1 — 

where Zk is defined by p.9p . This bound implies, up to the additive constant log (aE|x^|'')/2 , 
the same asymptotic behavior for the logarithm of the risk of adaptive estimate at each step of the 
procedure. Because by (SMB) the quantity A(fc) is supposed to be bounden by a small constant, 
and K is of order logn, then Ee^s[Z|] is small if 5 = 0(1/ log n). This means that for the 
case when S is an estimate for Eq only the logarithmic in sample size quality is needed. This 
observation is of particular importance, since it is known from |2S] that over classes of functions 
with bounded second derivative the rate n^^^^ of variance estimation is achievable only for the 
dimension d < 8 . 

Remark 3.3. The propagation property provides the adaptive procedure do not stop with high 
probability while A(fc) is small, i.e., under (SMB), and if the relative error S in the noise is 
sufficiently small. 

3.3 Quality of estimation in the nonparametric case: the oracle result 

Define the oracle index as the largest index k < K such that (SMB) holds: 

k* max{A: < K : A(fc) < A}. (3.15) 
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Theorem 3.3. Let A(l) < A, i.e., the first estimate is always accepted in the testing procedure. 
Let k* be the oracle index. Then under the conditions (Al) , {A2) , (A4) , (W) , (A5) the risk 
between the adaptive estimate and the oracle one is bounded with the following expression: 

E|(0fc. -0)^Bfc.(0fe. -g)r/2 (3.16) 



2(1-5) 



where (p{5) is as in Theorem 



Proof. By the definition of the adaptive estimate = 9^. Because the events {k < k*} and 
{fc > k*} are disjunct, one can write 

= Eiidk* - %)^Bfe. (gfe. - %)r/2i{fc < k*} 

+ E\{9k* - %)^Bfc. (0fc. - %)r/'l{fc > k*}. 

^ ^ dcf — ~ 

If k < k* then 9k' — ^juinj/j* fc} — ■ Thus, to bound the first summand, it is enough to apply 
Theorem [X^ with k = k* . 

To bound the second expectation, i.e., to bound fluctuations of the adaptive estimate 9 at the 
steps of the procedure for which the (SMB) condition is not fulfilled anymore, just notice that for 
k > k* the quadratic form coincides with the test statistics T^,, ^ 



-§)^Bfe.(0fe- -§) 



{9k* - %)^Bfe. {9k' - %) r^,. 



But the index k was accepted, this means that T^j: < for all I < k and therefore for I — k* 
Thus 

E\{9k' - 9)'^Bk' {9k' - 9)\'-^H{k > k*} < i^. 



□ 



3.4 Componentwise oracle risk bounds 

Theorem 13.31 provides the oracle risk bound for the adaptive estimator 9{x) — 9j:{x) of the pa- 
rameter vector 9{x) G corresponding to the estimator /g(x) of the type (II. 5p . It is interesting 
to have a look at the oracle quality of estimation of the components 9^^\ . . . , d'^P'> of the vector 9 
having in mind that the choice of polynomial basis leads to the direct estimation of the value of 
regression function and the derivatives by the coordinates of 9 . 

Denote by LPk{p — 1) a local polynomial estimator of order p — 1 corresponding to the fcth 
degree of localization, and, respectively, by LP'^'^{p — 1) its adaptive counterpart, i.e., LP'^'^{p — 

1) LP-j^{p — 1) . If the basis is polynomial and the regression function /(•) is sufficiently smooth 

m a neig hborhood of a; , then 9{x) is the LP'"^{p - 1) of the vector (x), . . . , /(P-^^a;))^ of 
the values of the function / and its derivatives at the reference point a; G M'' . 
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Now we are going to obtain a similar to the previous section oracle result for the components of 
the vector 6{x) , particularly for ej9{x) , j — 1, . . . ,p , where ej = (0, . . . , 1, . . . , 0)^ is the j th 
canonical basis vector in MP . As a corollary of this general result in the case of the polynomial 
basis we get an oracle risk bound for LP°''^{p — 1) estimators of the function / and its derivatives 
at the point x . 

Let LPk{p — 1) estimator of /^-'"^-'(a;) be given by 

fi'-^\x) - ejdkix), j^l,...,p, (3.17) 
Mx) ^ fi"\x)^eje,ix). 

Then the adaptive local polynomial estimators are defined as follows: 

f^'^'H^) = ej9{x), j^l,...,p, (3.18) 
fix) = eje{x). 

Similarly, the adaptive estimators of the function / and its derivatives corresponding to the k th 
step of the procedure are given by 

fi'-'\x)''^'ej9k{x), j = l,...,p. (3.19) 

Thus, if the basis is polynomial, the estimator f{x) f'^'^^x) is the LP°-'^{p — 1) estimator of 
the value f{x) , and p-'~^\x) with j = 2, . . . ,p are, correspondingly, the LP'^'^{p~ 1) estimators 
of the values of its derivatives. However it should be stressed that the results of Theorems 13.31 
and 13.91 hold for any basis satisfying the conditions of the theorems. We shall need the following 
assumptions: 

{A6) There exist < amin{k) < '^max{k) < oo such that for i : Xi G Uhf.{x) , with Uh^ix) given 
i>y Wfe the variances of errors from the parametric (known) model (|1.6p are locally uniformly 
bounded: 

iA7) Let assumption (AG) be satisfied. There exists a number Aq > such that for any k = 
1,...,K the smallest eigenvalue Ap(Bi;) > n/iJ?Aocr^ax(^) sufficiently large. 

Then, because ^ , for any k — 1, . . . K and for any 7 G we have 

7^6-^7 < %^ll7f < %[^ll7f , (3.20) 
where o^'^^xi^) =^ maxi<i<fe cr^a^lO ■ Thus we have the following lemma: 

Lemma 3.4. Let [AQ] and {A7) be satisfied. Then for any j = \, . . . ,p and k, k' = I, . . . K the 

following upper bound holds: 

nhtAo \'^' i^Tg^, _ Tg^^,| < ||Bf (0, _ 0,,)ll- 
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Proof. By ((3:20| taking 7 = B^^^(6/fc - 0^-) we have 



'k 

□ 

To obtain the "componentwise" oracle risk bounds we need to recheck the "propagation prop- 
erty". Firstly, notice that the "propagation conditions" (|2.17l) on the choice the critical values 

3i, . . . imply the similar bounds for the components ej6k{x) . Recall that 6k =^ ^min{fc fc} • 

Then, by (j2.17p , Lemma 13.41 and the pivotality property from Lemma 14. 1[ we have the following 
simple observation: 

Lemma 3.5. Under the propagation conditions (PC) for any 6 £ and all k — 2, . . . , K we 
have: 

Z^'^l, ) ^o.,s\eJek{x) - ejdk{x)\^^- < Eo,s||B^/'(0fc-0fc)f'- 

< aC{p,r). 

Here Eq^s stands for the expectation w.r.t. A/'(0,S) and C{p,r) is given by (|4.8p . 

As before we suppress the dependence on x . To get the propagation property we study for 
k — 1, . . . ,K the joint distributions of ejdi, ■ ■ ■ , ^J^k , that is the distribution of ej^k , the j th 
row of the matrix &k ■ Obviously, 

Ef,^„[eJ@k] = ej&l^{ejel...,ejel), 
Eg,s[ejek] = ej@k = (eje, ej9). 

Recall that the matrices 'Sk,o and have a block structure. Now, for instance, to study the 
estimator of the first coordinate of the vector 6 = 6{x) , or of f{x) in the case of the polynomial 
basis, we take the first elements of each block and so on. Denote the k x k covariance matrices of 
the j th elements of the vectors Oi, ... ,0k by 



= Dfcj(Jfc®S)D^^^., (3.21) 
= Dfc,,(Jfc®So)DT., (3.22) 



where Jk is a fc x fc matrix with all its elements equal to 1 , and the kxnk block diagonal matrices 
Dfcj is defined by 

Dfc,, =' ejDi © • • • © ejDk,= {h ® eJ)'Dk 
Di = B-i*Wi, / = l,...,fc. (3.23) 
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Moreover, the following representation holds: 

Sfe,, = (4®e7)Dfc(jfc®E)D^(/fe®e7)^ 

= (/fe ® ej)^Sfc(4 (g)ej), (3.24) 

where is defined by (I3.5p . Similarly, 

S/c,o,j = (/fe ® ej)^Sfc,o(4 «) ej). (3.25) 

Thus, the important relation (j3.8p is preserved for and ^k,o,j obtained by picking up the 

(j,j) th elements of each block of and 'Sk.o respectively. 

With usual notation 7'^' for the j th component of 7 G M'^ , denote by 

b,ik) iejiei-e),...,ejiei-e)y 

= ((6/t-6/)(j),...,(0^-0)(^))^ eM'= (3.26) 
A,(fc) h,{kY^-^]h,{k). (3.27) 

Theorem 3.6. "Componentwise" propagation property 

Under the conditions {Al) — {A7) and {PC) for any k < K the following upper bound holds: 



nhiAo ^ 



E\ejek{x)-ejek{x)\' 



< («E|x^r)V2(i + 5)p'^/4(i_5)-3p./4exp|^(5)_^zl^| (3.28) 



with ip{S) as in Theorem 

Corollary 3.7. Let the basis be polynomial. Then under the conditions of the preceding theorem 
nlt'Hx) ~ fl'''\x)r satisfy ^ 



Proof. The proof essentially follows the line of the proof of Theorem 13.21 If the distributions of 
vec 0fe were Gaussian, then any subvector is also Gaussian. 

Denote by IP^% = M {{eje, . . . , ejjy , S,,,) and by - A/" ((ej^t, . . . , ej©*)^, S,,o,,) , 

k — 1, . . . , K , the distributions of ej&k under the null and under the alternative. 
By the Cauchy-Schwarz inequality and Lemma [ 



with the Radon- Nikodym derivative given by Zkj = dlP^'^^/dlPg'^ . By inequalities p.24p and 
(|3.25p the analog of Condition (^14) is preserved for Sfc_o,j and S^j , that is, there exist S G [0, 1) 
such that 

(1 - S)'Sk,j ^ Sfc^oj ^ (1 + S)'Ek^, (3.29) 



for any k < K and j = 1 , . . . , p . Then the assertion of the theorem follows by the Taylor expansion 

(7, . . . 



at the point {ej6, . . . ,ej9)'^ and (|3.29p similarly to the proof of Theorem 13.21 □ 
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At this point we introduce the "componentwise" smah niodehng bias conditions: 



(SMBj) Let for some j — l,...,p, some k{j) < K , and some O'^^^ = eJO exist a constant 

A,(fc(j))<A,, (3.30) 



Aj > such that 



where Aj(fc) is defined by p.27p . 

Definition 3.8. For each j = 1, . . . ,p the oracle index k*{j) is defined as the largest index in the 
scale for which the {SMBj) condition holds, that is 

k*{j) = max{fc < K : Aj{k) < Aj}. (3.31) 

Theorem 3.9. Assume (Al) — {A7) and (PC) . Let the smallest bandwidth hi be such that the 
first estimate ej9i{x) be always accepted in the adaptive procedure. Let k*{j) be the oracle index 
defined by p.3ip . j = 1, . . . ,p . Then the risk between the j th coordinates of the adaptive estimate 
and the oracle one is bounded with the following expression: 

/nhtr.AoV^'^ 

V^fm] IE|4^^*0)(^)-e7^(-)r (3.32) 

r/2 



+ (aE|x^r)^/^(l + ^)'''=^-/^(l - ^)-^^'=^-/^ exp |^(<5)^^ 



< 

where (p{S) is as in Theorem\3. 

Corollary 3.10. Let the basis be polynomial. Then under the conditions of the preceding theorem 
the risk between the adaptive estimate and the oracle one E|/^'!^^"^^''(x) ^ f^-'~^'' {x)\^ satisfy p.32p . 

Proof. To simpliiy the notation we suppress the dependence on j in the index k . Similarly to the 
proof of Theorem l3.3l we consider the disjunct events {k < k*} and {k > k*} . Therefore, 

E|e70fe.(x) -ejg(x)r 
= E\ej9k>{x)^eje{x)\'-I{k<k*} 
+ E\eJek,{x)-eJd{x)\''I{k> k*}. 

By Lemma 13.41 and the definition of the test statistic T^, the second summand can be easily 
bounded: 



E|e76>fe. {x) - e]e{x)\'' I{fc > fc*} 



< E\\Bli\ek'{x) -d{x)wi{k > k*} 

< {''^ 

- Ok' ■ 

To bound the first summand we use the "componentwise" analog of Theorem 13. 2[ particularly, 
Theorem 13.61 and this completes the proof. □ 
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3.5 SMB and the bias- variance trade-off 

In [27] it was shown that the small modeling bias (SMBl) condition p.30|) can be obtained from 
the "bias-variance trade-off" relations. Notice that our set-up includes the set-up from [27] as a 
particular case. To prove that the similar relation holds in the present case we need the following 
definition. Given a point x and the method of localization w , for any j = l,...,p the "ideal 
adaptive bandwidths", see [22], [23] is defined as follows: 



k*{j) = max{fc < K : b^ju-r, (x) < C,iw)ak{x)^d(^}, (3.33) 
where Cj (w) is a constant depending on the choice of the smoother w , 



bkju-^){x) = sup \eJeUx)-f^'-'\x)\, 
i<i<k 

al{x) = Var/.So[eJ^/c(a;)], 



d{n) = \og{hK/h 



and Z'-'^-' stands for the function / itself. To bound the "modeling bias" Aj(fc) we need the 
following assumption: 

(j48) There exists a constant Sj > such that for all k < K 

Sfcj ^ (3.34) 

where 'Sk,j,diag = diag ( Var6i.s[ej0i(a;)], . . . ,Va.T0^^[eJ 6k{x)]) is a diagonal matrix composed of 
the diagonal elements of S^j . Thus we have the following result: 

Theorem 3.11. Assume (AA) , (Ab) and (^8). Let the weights {wk,i{x)} satisfy (|4.15p . Then 
for any given point x , smoothing function w , and j = 1, . . . ,p , the choice of k{j) — k*{j) defined 
by the relation (j3.33p with d{n) = 1 implies the (SMBj) condition Aj{k{j)) < Aj with the 
constant Aj — SjCj{w){l + 6){1 — Uq^)~^ . 

Proof. Consider the quantity bj{k)^H~^]j ^^^gbj{k) . Suppose that ej0(a;) = /'•'"^^ (x) . In view of 
relation ()4.15|) for the weights {wi^i{x)} the form of the matrix Tikj.diag is particularly simple: 



Then by (A5) and 



Sfcj,diag = diag(ejBi ^e^-, . . . , ejB^ ^e^- 



1=1 

k 

2 ■ 



1 = 1 S 

(^fc./(j-i)(x))^ -(fe-0 
^fe ^3 1=1 



< 



{h.fU-^->{x)) jl + S) 
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By ((5351) with d{n) = 1 , the choice of A: = k*{j) imphes (^^jo-d (a;))^ < C'j{w)al{x) . Thus 
and 

□ 

Remark 3.4. Using the standard technique it is easy to derive from the above resuh that for 
estimation of functions over Holder classes the methodology proposed in [TB] and [17] and generalized 
in the present paper delivers the minimax rate of convergence up to a logarithmic factor. 



4 Appendix 

4.1 Pivotality and local parametric risk bounds 

Lemma 4.1. Pivotality property 

Let {A3) hold. Let 6i ^ ■ ■ ~ 9^ = for x < K . Then for any k < x the risk associated 
with the adaptive estimate at every step of the procedure does not depend on the parameter 6 : 

Ee\(9k-dk)'^Bk{ek-ekW = Eo|(0fe-0fe)^Bfc(gfe-gfe)r, 

where Eq denotes the expectation w.r.t. the centered measure 7V(0,S) or 7V(0, Eq) • 

Proof. After the first k steps 6^ coincides with one of 9m , m < k , and this event takes place if 
for some I < m the statistics Ti_m+i > li ■ In view of the decomposition (|2.8p it holds 

{r;,„i+i > ii for some / = 1, . . . , m \H,n+i} 
= - 0m+i)^B;(0j - 0,„+i) > 3/ for some ; = 1, . . . ,TO 

2 



b;/2 (Br^^w^sy^e - B-/+i*w„,+ii]y^£ 



> 3; , I <m 



with e ^ JV{0,Ln). The probability of this event does not depend on the shift 9, so without 
loss of generality 9 can be taken equal to zero. The risk associated with 9k admits the following 
decomposition: 



fe-i 



m— 1 

Under the conditions of the lemma for all to < A; the joint distribution of {9k — 9m)^Bk{9k — 9m) 
does not depend on 9 by the same argumentation. □ 

To justify the statistical properties of the considered procedure we need the following simple 
observation. Let for any 9 , 0' e O the corresponding log-likelihood ratio L{Wk,9,9') be defined 
by (EH]). Then 

2L(Wfc, 0, 9') = \\Wl^^{Y ~ vj/T0')|j2 _ ||w^/2(l' - *^0)f . 
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Theorem 4.2. Quadratic shape of the fitted log-likelihood 

Let for every k — 1, . . . , K the fitted log likelihood (FLL) be defined as follows: 

L(Wfc,0fc,0') ='maxL(Wfc,0,0')- 

Then _ _ _ 

2L(Wfc, ek,e) = {Ok - efBkiOk - e). (4.i) 

Proof. Notice that L(Wfc,0) defined by (|2.3p is quadratic in . The assertion follows from the 
Taylor expansion of the second order at the point dk , because it is the point of maximum, and the 
second derivative is a constant matrix . □ 

Let the matrix S be defined as follows: 

S ='l]y'Wfc*^Bj:i*WfcSy'. (4.2) 

Then for the distribution of L{Wk, 9k, ^k) one observes so-called "Wilks phenomenon", see [7], 
described by the following theorem: 

Theorem 4.3. Let the regression model be given by (jl.ip and the parameter maximizing the expected 
local log-likelihood 91 ~ 9*f.{x) be defined by (I2.6p . Then for any k — 1, . . . , K the following equality 
in distribution takes place: 

2L{Wk,9k,9l) = Xi{S)el + --- + Xp{S)el (4.3) 

with p = rank(Bfe) — diniS = p . Here Ai(S), . . . , Ap(S) are the non-zero eigenvalues of the matrix 
S , and Ei are independent standard normal random variables. 

Moreover, under {A4) the maximal eigenvalue Xmax{S) l£ ^ + S , and for any 3 > 

lp[2L{Wk,9k,9l)>i} <lP{,^>i/{l + S)}, (4.4) 

where rj is a random variable distributed according to the lo,^ with p degrees of freedom. 
Remark 4.1. Generally, if the matrix B^ is degenerated in (14. 3p the number of terms p < dimG . 
Proof. By Theorem 14.21 and the decomposition (|2.8|) it holds that: 

2LiWk,9k,9l) = {9k-9iyBk{9k-9l) 

(B-i*WfeSj/'£)^Bfc(B-i*WfeSy'£) 



where the symmetric matrix S is defined by (|4.2p . Then by the Schur theorem there exist an 
orthogonal matrix M and the diagonal matrix A composed of the eigenvalues of S such that 

S = M^AM . For £ TV (0, /„) and an orthogonal matrix M it holds that e =^ Me ~ A/" (0, /„) . 
Indeed, EMe = Ee = and 

VarMe = EM£(Me)^ = ME(££^)Me = MM^ = /„. 
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Therefore, 

On the other hand, the matrix S ~ Sp'^W^'i' B^. ^fW^Sg' can be rewritten as: 

with Ilfc = Wj,' '4' Bj, '4'Wj.' . Notice that 11^ is an orthogonal projector on the hnear subspace 
of dimension p = rank(Bi:) spanned by the rows of matrix . Indeed, 11^: is symmetric and 
idempotent, i.e., 11^ = 11^. . 

Moreover, rank(nfe) = tr(nfe) = tr(W^/'*^B-i*W^/') = tr(B-i*Wfe*^) = tr(B-iBfe) = 
tr(/p) = p . Therefore 11^ has only p unit eigenvalues and n—p zero ones. Notice also that the nx 

n matrix S has rank(S) = rank(nfeW^/^I]y^) = rank(nfe) = p as weU. Thus 2L{Wk,9k,0l) = 
Xi{S)e1 + • • ■ + Ap(S)ep , where Ai(S), . . . , Ap(S) are the non-zero eigenvalues of the matrix S . 
Recall the definition of the matrix norm induced by the L2 vector norm; 

\\A\\ = ^XrnaAA-^A). (4.5) 

Thus, taking into account Assumption (A4) , the induced L2 -norm of matrix S can be estimated 
as follows: 

||S|| ^ iisJ/^w^/^nfeW^/^sy^ii 

max 

^2 

< (1 + (5) max{wk,i} <1 + S. 

i ' 

Therefore, the largest eigenvalue of matrix S is bounded: Xmaxi^) < 1 + S . 
The last assertion of the theorem follows from the simple observation that 

P {\i{S)el + ■■■ + Xp{S)el >i}<lP {A™a.(S)(£? + • • • + e^) > 3} • 

□ 

Corollary 4.4. Quasi-parametric risk bounds 

Let the model be given by (jl.ll) and 9^ ~ 9l{x) be defined by (|2.6I) . Assume (AA) . Then for 
any fi < 1/(1 + S) 

Eexp{fiL(Wk,9k,el)} < [1 - + ^r^/' (4.6) 

E\2L{Wk,9k,9*kW < {l + 6YC{p,r), (4.7) 

where 

C(p,r) =E|x^r = 2'T(r+p/2)/r(p/2). (4.8) 
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p 



Proof. By (|4.3p and the independence of Si 

Ecxp{fiL{Wk,ek,9l)} = Eexp<' |^A,(S)e? 

= nEcxp{^A.(S)£,?} 

1=1 



< [l-MA„a,(S)]~P/' 
Let ij ^ Xp ■ Integrating by parts yields the second inequahty: 

■ lP{ii>i/{l + 6)}f~'di 
Jo 



< r 



□ 



Jk is true, that 



4.2 Proof of the bounds for the critical values 

Denote for any I < k the variance of the difference 9k — 9i by Vik ■ 

Vife =' Var(0fe - 00 ^ 0. (4.9) 

Then there exists a unique matrix Vi]/^ >~ such that (Vi]/^)'^ — Vik ■ 

Lemma 4.5. Assume (AA) , (A3) and {A5) . If for some k<K the hypothesis 
is, if 9* — ■ ■ ■ = 91 = 9 , then for any I < k it holds that: 

lp[2UWu9i,9k)>l] < F{v>i/KnaAVil^^BiVil^^)} 

< P{7]>i/to} 

P {2L{W k, 9k, 9i)>i} < lP{v>d/XmaAVil^^BkV^l^^)} 

< lP{v>i/ti} 

where to = 2(1 + S){1 + < ^''"'') , ti = 2(1 + <5)(1 + mC^-')) , and T] is the Xp -distributed random 
variable. 

Proof The Hk and imply 

9i-9k= B-i^WjSy^e - B-i*WfeSj/'e = 
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where ^ is a standard normal vector in . Thus by Theorem 14.21 for any I < k 

2L{Wi,ei,e,) = \\B]/\9i-e,)r = e^i^f B,y,f e 

By the Schur theorem there exists an orthogonal matrix M such that 

where e is a standard normal vector, A — diag(Ai (VjJ.^^BfVj)^/^)), • • • , Xp{Vi]/'^'BiVi]/'^)) , and p = 
rank(Bj) . Therefore, 

2L{Wue,A) 4 Ai(l^f B,y,f )e? + . . . + Ap(y,f B,l^f 

where XjiVfl/^'BiVi]/^) , j — 1, . . . ,p , are the nonzero eigenvalues of Vj],/^B;V^^^^ . 
By the similar argumentation: 

2L(W,,0fc,0O 4 X,{vi^'B,vi^')sl + ■■■ + Ap(X^f B,y,f 
Denote by rj the Xp -distributed random variable, then 

]p[2L{Wi,9i,9k)>i} < p{rj>i/X,na4V'J^BiVil/^)} 

J'{2L(Wfc,0fe,0O >3} < lP{v>3/>^maAVil^^BkVlJ^)'j 

For any square matrices A and B we have {A — B){A^ — B^) < 2{AA^ + BB^) . Application of 
this bound to the variance of the difference of estimates yields 



Vi, = (Br^TOSy^-B-^*W,s;/^) (Br^*W,Ej/^-B-i*W,Ej 

< 2(B~i*WjI]oWj*^B-i + B-i*WfcI]oWfe*^B-i) 
- 2^^ + 214, 

where V/ = Var0; , I < k . By p.2p and by Assumption (S) we have: 

Vi ^ (l + ,5)B-\ 

Vk ^ il + 5)B-' ^il + S)u-^''-'^B-\ 
Vik ^ 2(l + <5)(l+<('=-'))B-i. 

Therefore, 

-{k-i) 



l/2> ^ 



Bi^2{l + d){l+u-^'-'^)Vr,'. (4.10) 
Thus by (|4.10p the upper bound for the induced matrix norm reads as follows: 

= sup 7^V^zf B,l^f 7 

Il7ll = l 

< 2(l + 5)(l + ^-('^-')) sup 7^^,f ^fe'^f7 

ll7ll = l 

< 2(1 + 5)(1 + (4.11) 
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Similarly, 

Viu < 2(l + 5)(l + w('=-'))B^\ 
\max{V^t!^W^-^^) < 2(1 + 5)(1 + 

These bounds imply 

< F > 3 [2(1 + 5)(l+Uo 

2{1 + S){1 + u^''-''>) 



(4.12) 



□ 

Lemma 4.6. Under the conditions of the preceding lemma for any /ip < tg^^ , or < t^^ 
respectively, the exponential moments are hounded: 

Eexp{^oL(Wi,0i,0fc)} < 

Eexp{^iL(Wfc,0fc,0O} < 

where to = 2(1 + S){1 + < *''"'^) anrf ti 2(1 + (5)(1 + uC^-')) . 

Proof. The statement of the lemma is justified similarly to the proof of Corollary 14.41 The bounds 
(j4.11l) and (I4.12p imply the bounds for the corresponding moment generating functions: 



:exp{/^L(Wi,0i,0fc)} 



= niEexp{|A,(FrB,yr)^n 



< 



< [l-2^(l + (5)(l + u('=-'))]-f/2^ 



□ 



Lemma 4.7. Under the conditions of the preceding lemma it holds that: 

E\2L{Wi,9i,9k)r < 2-C{p,r){l+6ni + u^^''-'^y 
E|2L(W,,0fc,0Or < 2-C{p,r){l + Sni + u^''-'^Y, 
where C{p,r) is as in (|4.8p . 
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Proof. Integrating by parts and Lemma 14.51 yield for the second assertion 



< r lP<V>i 



2(l + 5)(l + u('=-')) 



= 2^{i + sY{i + u^''-'^yiE\T]\\ 



where ?7 ~ • The first assertion is proved similarly. 



□ 

Proof, of Theorem \3.1\ The theoretical choice of the critical values The risk corresponding to the 
adaptive estimate can be represented as a sum of risks of the false alarms at each step of the 
procedure: 

k-l 



By the definition of the last accepted estimate 9k , for any m = l,...,fc — 1, the event {0 k ~ 
9m} happens if for some / = 1, . . . , to the statistic Ti^m+i > li ■ Thus 

rri 

{9k^9ra} C |J{Tz,™+i >3z}. 

1=1 

It holds also that for any positive ^ 

I{r,,„+i >3,} - l{2U'^u9u9m+i)~U > 0} 
< exp{^L(W,,0i,0™+i)-^3,}. 

This simple fact and the Cauchy-Schwarz inequality imply for to = 1, . . . , fc — 1 the following bound: 

Eq.eK^*; ~ ^m)^Bfe(0fe — — 9m} 

= Eo.E|2L(Wfe,0fc,0„Or'I{^fe - Sm} 

rn 

< ^e-T3<Eo,s [|2L(Wfc,gfc,0„Orexp{^L(W,,0;,0,„+i)}' 



1=1 



2 

1/2 



< ^e-f3, |]E„ j, J|2L(Wfc,0fc,0„OP'']} {Eo,s [exp{ML(Wz,0i,0„+i)}]} 



1/2 



1=1 



By Lemma l4!6l with 5 = 



E, 



exp{/^L(W,,0,,0„+i)} <(l-4/x) 



-p/2 
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This together with the bound from Lemma H771 gives 

k~l m 

< 2VC(P, 2r)(l - ^ ^e-^3'(l-t-('=-™)^'' 



- 

m=l 1=1 
k~l fe-1 



2V<^(P, 2r)(l - V)-"/* ^ e-T3i ^(1 + ^(fc-™))'- 



fc-i 

< 22VC(p, 2r)(l - ifiyP/^l - u-"-)-^ e"^^'u'^^'="'\ 

1=1 

because —{k — I) < —{m — I) and 



m=l m=l 

k-1 



m— / 



Since u^'^^ < u'''^ for any I < k < K the choice 
4 



3i 



- \^r{K - I) \ogu + log {K/a) - | log(l - 4^) - log(l - u"'') + C(p, r)} • 
with 

, 1 f 2^-[r(2r + p/2)r(p/2)]V^ ] 



provides the required bound 



'&^,Y\{ei-eiY^i{ei-ei)\^ <aG{v,r) forall l = 2,...,K. 



□ 



4.3 Matrix results 

Lemma 4.8. The matrices (8) S and (8) So ^'"6 positive semidefinite for any k — 2, . . . , K . 

Moreover, under Assumption (AA) with the same d , the similar to (AA) relation holds for the 
covariance matrices and Sfc,o of linear estimates: 

(1 - <5)Sfe ^ Sfc,o ^ (1 + , k<K. 

Proof. Symmetry of Jk and E , (respectively, Eq ) implies symmetry of Jfe S , (respectively, 
Jfc (g) So). Notice that any vector 7„fe G -2?"*^ can be represented as a partitioned vector -f^f. = 

((7ilV, (7i'V, • • • , (7itV) , with 7^ e 1?" , ; = 1, . . . , fc . Then 

k k 

7^,(Jfc®S)7„. - (E^ife)^^(E^i'i) ^"^^ (4.13) 

1=1 1=1 
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where 7„ J2i=i^nl ^ ■ Because E it implies 7^S7„ > for all 7„ ^0. But even 
for 7nfc 7^ , if its subvectors {7^'/} are linearly dependent, 7„ can be zero. Thus there exists a 
nonzero vector 7 such that {Jk ® 2)7 = . This means positive semidefiniteness. 

The second assertion follows from the observation that Assumption (AA) due to the equality 
(|4.13p also holds for the Kronecker product 



Therefore 



(1 - S)Jk S ^ Jfe Eo ^ (1 + S)Jk (E> S. 
(1 - J)Dfe(Jfe ® S)D^ ^ Dk{Jk ® So)D^ ^ (1 + 5)Dfe(Jfe I])Dj, 



Lemma 4.9. Fza; x £ iR'' . Suppose that the weights {wi,i{x)} satisfy 

wi,i{x)wm,i{x) ^ wi^.i{x) , l<m. 



(4.14) 



□ 



(4.15) 



Then under Assumptions {Al) , {A2) , {Ab) the covariance matrix S/j defined by (|3.5p is nonsin- 
gular with 



detSfe =detBj:i J|det(B,"l\ -B,"^) > , k = 2,...,K. 

1=2 

Remark 4.2. The condition (|4.15p holds for rectangular kernels with nested supports. 
Proof. The condition (j4.15p implies 

W/SW,,! = diag(w/4W„4/cri,...,W;,„W„i,„/cr^) = W; 

for any I <m . Thus the blocks of simplify to 

ASAl - Br'*w,sw„,*^B-i = Bri*w,*^B-i 



(4.16) 



and Sfe has a simple structure: 



' 1 ^1 B.7I 



/Br; 

B9 B9 Bo 



b.tV 



Then the determinant of coincides with the determinant of the following irreducible block 
triangular matrix: 



detSi. 



Bi - B2 ^ B2 ^ - B3 ^ 




B2 ^ - B3 ^ 



Bfc-i-Bj^^ Bj^^ 

Bfc 1 B^: B^: 













B,. ^1 B,, ^ Bj. ^ 



'k-l 
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implying 

detSfe = dct(Bj;^ -B2"^)det(B2'^ - B^'^) ■ . . . ■ det{B'^:^^ -Bj:^)detB^\ 

Clearly the matrix is nonsingular if all the matrices BJ[}^ — arc nonsingular. By (Al) 
and {A2) Bi y for any I . By {A5) there exists mq > 1 such that B; y uqBi-i , therefore 

Br_\ - Br' h{i- i/uo)Bi_\ >- Br\ ^ o . □ 

Lemma 4.10. In the "nonparametric situation" the moment generation function (mgf) of the joint 
distribution of 6i, . . . , Ok is 

Eexp{7^(vec0K -vec0^)} = exp | ^7^X^,0 7|- (4.17) 

Thus, provided that Sk^q )^ Q , it holds that vec@K ^ A/'(vec©J^, ^k,o) ■ 

Similarly, in the "parametric situation", if >- , then the joint distribution of vec &k is 
A/^ (vec ©if , Si^) with the mgf: 

Eexp {7^(vec0K - vec©^)} exp | -^^Hk 7 i- (4-18) 



Proof. Let 7 S IRP^ be written in a partitioned form 7^ — {jj , . . . , ^]^) with 7; G IRP , I = 
I, . . . , K . Then the mgf for the centered random vector vec ©^ — vec e RP^ , due to the 
decomposition I^Bl Oi = 0* + DiY^J'^e with Di = B;"^*Wi , can be represented as follows: 

K 

Eexp{7T(vec0if - vec©Jf)} = E exp { ^ 7;^ (0; -6*1)} 

1=1 

= Eexp{f;7rASr£} -KcxpK^ A^tO^S]^^^- 
1=1 1=1 

A trivial observation that J2iLi li is a vector in J?" and S^^e ^ A/'(0, Sq) by implies 
by the definition of ^k,o the first assertion of the lemma, because 

E exp { ( X: Dj^,fi:y\] = exp I i ( DJ i:, ( ^^7.) 
i=i ^ 1=1 /=i 

= exp|i(Dj7)^(JK®So)Dj7| =exp|i7^SK,o7|, 

here is defined by □ 

4.4 Proof of the propagation property 

Lemma 4.11. The Kullback-Leibler divergence between the distributions of vec©^ under the al- 
ternative and under the null has the following form: 

2KUIPl^„.K^) 2E^^So log (^^) 
= A(fc)+log(^^^^) +tr(S,iS,,o)-pfc, (4.19) 
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where 

b{k) '^=^^ vec0^-vec0fe (4.20) 
A(fc) b{kyi:^^b{k). (4.21) 

Proof. Denote the Radon- Nikodym derivative by Z^. '== dlPji / AlPg . Then 



1 , / det Sfc \ 1 



log(^.(.)) = ^logU^ - -||E-/^(,-vee0 



'k) 



+ i||S-'/^(2/-vec0,)f (4.22) 

can be considered as a quadratic function of vec 0^ . By the Taylor expansion at the point vec 0^ 
the last expression reads as follows 

iog(z.(,)) = iiog(^) - i||i:^,r(»-™ej)|P 

+ \\\^1"^(V - vece;)f + i.()n)Ts-i(j - vece;,) + lA(fc). 

Then the expression for the Kullback-Leibler divergence can be written in the following way: 
KUlPl^„ , JP|,s) E/.So log {Zk) 

where ^ ^ N {Q, Ipk) ■ This implies 

2Kh{Pl^^,lPl^) = A(fc) +log (^^) +tr(S^iS,^o) - pk. (4.23) 

In the case of homogeneous errors with cro,i — ctq and ai — a,i — l,...,n the calculations 
simplify a lot. Now 

with a, pk X pk matrix Vj, defined as 

Vfe = (:di © • • • © :Dfc) (jfe (g) j„) (:di © • • • © Cfe)^, 

where A = (*W;*^)"i*W; , ^ = l,...,fc does not depend on a. Then A(fc) = cr"2Ai(A:), 

with Ai(fc) =^ b{kyV^^b{k) , detSfc/detSfe,o = (cr^/crly'' , and the expression for the Kullback- 
Leibler divergence reads as follows: 

mPl^o^H.^) - pfclog(-) + iA(fc) + ^(4-l) (4.24) 

(To Z Z (7" 

implying the same asymptotic behavior as in p.l3p . 

□ 
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Proof, of Theorem lS.Si (Propagation property) 

Notice that for any nonnegative measurable function g = g{&k) the Cauchy-Schwarz inequaUty 
imphes 

E/.s„[5] = Ee^^igZ,] < {EoMg']y^\EeMZi]f' (4-25) 
with the Radon-Nikodym derivative Zk — dlP^ /d-fg 5; ■ One gets the first assertion taking 
g = - eyBk(9k - 0)r/^ , and applying "the parametric risk bound" with 6 = from (|4Jl) : 

= (Ee,E |2 L(Wfc , Ok, 0) D (Ee,s [Zl]) ^'^ 



< 



{E\xln'/'{EeMZl]) 



2ni/2 



The second assertion is treated similarly by application of the pivotality property from Lemma l4.1l 
and the propagation conditions (|2.17p . 

To calculate Ee,E[2'^] let us consider log Zk given by 

+ i||s-"^,-wce,)||' 

as a function of vec 0^ . Application of the Taylor expansion at the point vec 0^ yields 

21ogZ, = log^2i^-||S-f (2/-vec0fc)|P + ||Sfc'/'(j;-vec0,)f 
det ljk,o 

+ 26(fc)^S-i(y - vec0fe) ^ 5(fc)^S-i6(fc). 
With ^ ^ J\f {Q,Ipk) the second moment of the Radon-Nikodym derivative reads as follows 
^eAZl] 

.exp{-Kfc)Ts->(fc)}Ecxp{-||S,y^sf CiP + lief + 26(fc)^S-isfa 



fc.O 



detS 

X exp{26(fc)^S^_isf (2Sf S^_iS^/^ - /pfe)-'E^/^S,->(fc) - fe(fc)^S,->(^)} 

i^[n{2A.(sf ^-Sf ) - 1}]-/^ (4.26) 



To estimate the obtained expression in terms of the level of noise misspecification S notice that 
the condition p.8p implies 

1 \^'' detSfe / 1 



det Sfc V 1 — ^ 
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Eh. pk / - 



1+5 



<[n{2A.(sfi:,Jsf)-i}]-/^<(i 

Therefore the quantity in the exponent in (j4.26p is bounded by: 



^ ^ l)6(fc)^S->(fc) 



(l + <5)2 



Moreover, 



Finally, 



l + S 1 + S ^ ' ^ ^ 
< 6(fc)TS->(fc) 



+ lV(l + '5)^ J l + S 



In the case of homogeneous errors the expression for log Zk reads as 

a . 1,1 1 



logZ, = pfclog(^) + -(—-—) II V;i/'(y-vec0, 
1 „ . 1 



&(fc)^V,-i(y-vec0fe) - _6(fc)TV,Ti6(fc), 



^0 



implying 

9 \ pk 



^oAZl] = ( ) ( TTZT^ ) exp 
By Assumption (AA) 



a 



,2(72 




{I 


-5 




f (5)3 




+ s 


{I 


-(5)3 



6(fc)^V-^6(fc) 1 

2(72 -(^2 / 



~ ' Ai(fc) 



exp 



(72(1 +,5) 



<Ee,.[^.^] < (t^^) exp^^ii^^ (4.28) 



.f^'(l-'5). 

where p is the dimension of the parameter set and k is the degree of the localization. □ 
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